XGBoost Hyperparameter tuning: XGBRegressor (XGBoost Regression)

10 minute read

Published: March 10, 2022

XGBoost stands for Extreme Gradient Boosting, is a scalable, distributed gradient-boosted decision tree (GBDT) machine learning library. It provides parallel tree boosting and is the leading machine learning library for regression, classification, and ranking problems (“Nvidia”).

In this tutorial, we will discuss regression using XGBoost. We will develop end to end pipeline using scikit-learn Pipelines()and ColumnTransformer(). We will also tune hyperparameters for XGBRegressor() inside the pipeline.

Additionally, we will also discuss Feature engineering on the NASA airfoil soil noise dataset from the UCI ML repository. You can download the data using the following link.

DataLink: https://archive.ics.uci.edu/ml/machine-learning-databases/00291/airfoil_self_noise.dat

Donor:

Dr Roberto Lopez

robertolopez ‘@’ intelnics.com

Intelnics

Creators:

Thomas F. Brooks, D. Stuart Pope and Michael A. Marcolini NASA

Data Set Information:

The NASA data set comprises different size NACA 0012 airfoils at various wind tunnel speeds and angles of attack. The span of the airfoil and the observer position were the same in all of the experiments.

Attribute Information:

This problem has the following inputs:

Frequency, in Hertzs.
Angle of attack, in degrees.
Chord length, in meters.
Free-stream velocity, in meters per second.
Suction side displacement thickness, in meters.

The only output is:

Scaled sound pressure level, in decibels.

What you will learn?

Different EDA techniques: Histogram, Q-Q plot, Heatmap and correlation plot, Box-plot
Data Preprocessing and Feature Transformation : box-cox transformation, QuantileTransformer, KBinsDiscretizer etc.
Scikit-learn pipelines with ColumnTransformers
XGBoost Regression with Scikit-learn pipelines with ColumnTransformers
Hyper parameter tuning for XGBoostRegressor() using scikit-learn pipelines
Different regression metrics: r2_score, MAE, MSE.
Bonus: sweetviz library

How to tune XGBRegressor() using RandomizedSearchCV()

Download data and Install xgboost.

!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00291/airfoil_self_noise.dat
!pip install xgboost

--2022-03-09 15:04:32--  https://archive.ics.uci.edu/ml/machine-learning-databases/00291/airfoil_self_noise.dat
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 59984 (59K) [application/x-httpd-php]
Saving to: ‘airfoil_self_noise.dat’

airfoil_self_noise. 100%[===================>]  58.58K  --.-KB/s    in 0.1s    

2022-03-09 15:04:33 (417 KB/s) - ‘airfoil_self_noise.dat’ saved [59984/59984]

Requirement already satisfied: xgboost in /usr/local/lib/python3.7/dist-packages (0.90)
Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from xgboost) (1.4.1)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from xgboost) (1.21.5)

Read the downloaded data in the pandas dataframe.

import pandas as pd
df = pd.read_table("airfoil_self_noise.dat", header = None)
df.columns =['freq', 'angle','chord','velocity','thickness','soundpressure']
df.head()

	freq	chord	velocity	thickness	soundpressure
0	800	0.3048	71.3	0.002663	126.201
1	1000	0.3048	71.3	0.002663	125.201
2	1250	0.3048	71.3	0.002663	125.951
3	1600	0.3048	71.3	0.002663	127.591
4	2000	0.3048	71.3	0.002663	127.461

To prevent leakage in train and test data let’s first split data into train and test set using the scikit-learn train_test_split.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( df.drop(['soundpressure'], axis = 1) , df['soundpressure'], test_size=0.2, random_state=0)

Exploratory data analysis (EDA) on airfoil data.

Since it is a regression problem, let’s plot the histogram and QQ-plot to visualize data distribution.

Linear models assume that the independent variables are normally distributed. If this assumption is not met algorithms produce poor results.

We can determine whether a variable is normally distributed with:

Histograms and
Q-Q plots

A histogram is a graphical representation of the distribution of data. Whereas, In a Q-Q plot, the quantiles of the independent variable are plotted against the expected quantiles of the normal distribution. If the variable is normally distributed, the dots in the Q-Q plot should fall along a 45 degree diagonal.

import scipy.stats as stats
import matplotlib.pyplot as plt
def diagnostic_plots(df, variable):
    print("\n\n Feature name is : {}\n".format(variable))
    plt.figure(figsize=(15,6))
    plt.subplot(1, 2, 1)
    plt.title("Histogram")
    df[variable].hist(bins='auto')
    plt.subplot(1, 2, 2)
    stats.probplot(df[variable], dist="norm", plot=plt)
    plt.show()

for col in X_train.columns:
    diagnostic_plots(X_train, col)

 Feature name is : freq

png

 Feature name is : angle

png

 Feature name is : chord

png

 Feature name is : velocity

png

 Feature name is : thickness

png

The above histogram plot shows velocity and chord features are categorical. Let’s check the unique values on these columns.

X_train.velocity.nunique(), X_test.velocity.nunique()

(4, 4)

X_train.velocity.unique()

array([71.3, 55.5, 39.6, 31.7])

X_train.chord.nunique(), X_test.chord.nunique()

(6, 6)

X_train.chord.unique()

array([0.0254, 0.2286, 0.1524, 0.1016, 0.3048, 0.0508])

The velocity column has two unique values whereas the chord column has six unique values. You can visualize it on the histogram and in the Q-Q plot.

We can directly apply label encoding on these features; because they represent ordinal data, or we can directly use both the features in tree-based methods because they don’t usually need feature scaling or transformation. However, I would like to introduce another method to encode these data called KBinsDiscretizer().

Why KBinsDiscretizer()?

Velocity and chord might change over time, and KBinsDiscretizer() can discretize the data based on clustering and encode them in an ordinal fashion. In this transformation, we will use kmeans strategy to cluster data and assign nominal values. This approach is applied if data is clustered around some number of centroids. We will take four centroids for velocity and six centroids for the chord feature.

What about other features?

The freq feature is not normally distributed because the histogram is skewed, and the Q-Q plot does not fall along 45 degrees diagonal. Therefore we need to transform this numerical feature. There is a lot of feature transformation technique. Some of them are:

Transforming variables with the logarithm
Transforming variables with the reciprocal function
Using square and cube root to transform variables
Using power transformations on numerical variables
Box-Cox transformation on numerical variables
Yeo-Johnson transformation on numerical variables

A simple generalization of both the square root transform and the log transform is known as the Box-Cox transform. We will use this approach first and see the result. If the result is ok we will move on if not we will try another approach.

import seaborn as sns
train_freq , freq_lambda = stats.boxcox(X_train['freq'])
sns.distplot(train_freq)

png

Wow! the data is now normally distributed.

Now let’s visualize the the correlation between the features on the heatmap plot.

plt.figure(figsize=(8,5))
sns.heatmap(X_train.corr(), annot = True)

png

The angle and thickness features are highly correlated with score = 0.75 therefore we will drop the angle column.

X_train.drop(['angle'], axis = 1, inplace =True)
X_test.drop(['angle'], axis = 1, inplace = True)

The thickness column is also highly skewed and contains outliers. Therefore we will apply QuantileTransformer() to this feature. You can learn more about QuantileTransformer() on scikit-learn.

QuantileTransformer()

This method transforms the features to follow a uniform or a normal distribution. Therefore, for a given feature, this transformation tends to spread out the most frequent values. It also reduces the impact of (marginal) outliers: this is therefore a robust preprocessing scheme.

sns.boxplot(x = X_train['thickness'])
plt.title("Box plot showing outliers")
plt.show()

png

from sklearn.preprocessing import QuantileTransformer
scaler = QuantileTransformer()
scaler.fit(X_train[['thickness']])
train_thickness = scaler.transform(X_train[['thickness']]).flatten()
sns.distplot(train_thickness)
sns.boxplot(x = train_thickness)

png

stats.probplot(train_thickness, dist="norm", plot=plt)

png

The data is about normally distributed. The outliers has been handled.

Feature Engineering and Transformation on airfoil data based on EDA

The training data contains [freq, chord, velocity, thickness] features. From EDA we have to apply following transformations in each features.

freq: Box-cox Transformation
chord: KBinsDiscretizer with 6 bins
velocity: KBinsDiscretizer with 4 bins
thickness QuantileTransformer

To apply individual transformation on features we need scikit-learn ColumnTransformer().

Let’s first Import all necessary library

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error, mean_absolute_error
import xgboost as xgb

X_train.head(2)

	freq	chord	velocity	thickness
1058	800	0.0254	71.3	0.004207
408	315	0.2286	55.5	0.011171

Apply ColumnTransformer in each column. Remember, we have to specify column index to let the transformer know which transformation to apply on what column.

Here [0] means freq, [1] means chord … and so on.

transformer = ColumnTransformer(transformers=[
    ('freq',PowerTransformer(method='box-cox', standardize=False),[0]),
    ('chord', KBinsDiscretizer(n_bins = 6, encode='ordinal', strategy='kmeans'),[1] )
    ('vel',KBinsDiscretizer(n_bins = 4, encode='ordinal', strategy='kmeans'),[2]),
    ('thickness',QuantileTransformer(),[3]),
    ],
    remainder='passthrough'
    )

Now, we have to apply XGBoost Regression on our data. Hence, we need to integrate XGBRegressor() and ColumnTransformer() object in the pipeline as shown below:

from sklearn.pipeline import Pipeline
pipe = Pipeline(steps=[("preprocessor", transformer),
                       ("model", xgb.XGBRegressor(objective= 'reg:squarederror',
                                                  learning_rate = 0.1,
                                                  n_estimators =400, 
                                                  max_depth = 3, 
                                                  seed = 0))])

The above approach might not give the best results because the hyperparameter is hard-coded. Therefore, need to tune hyperparameters like learning_rate, n_estimators, max_depth, etc. The two easy ways to tune hyperparameters are GridSearchCV and RandomizedSearchCV. Since RandomizedSearchCV() is quick and efficient we will use this approach here.

We will enclose Pipelines() inside RandomizedSearchCV()and pass necessary hyperparameters and cross validate the results. Here we will track r2_score.

from sklearn.model_selection import RandomizedSearchCV
hyperparameter_grid = {
    'model__n_estimators': [100, 400, 800],
    'model__max_depth': [3, 6, 9],
    'model__learning_rate': [0.05, 0.1, 0.20],
    }

pipeline = RandomizedSearchCV(
    Pipeline(steps=[("preprocessor", transformer), 
                    ("model", xgb.XGBRegressor(objective= 'reg:squarederror',seed = 0))
                    ]),
    param_distributions=hyperparameter_grid,
    n_iter=20,
    scoring='r2',
    n_jobs=-1,
    cv=7,
    verbose=3)

model__ is given before each hyperparameter because the name of XGBRegressor() is model.

let’s fit the entire pipeline on Train set.

pipeline.fit(X_train, y_train)

Fitting 7 folds for each of 20 candidates, totalling 140 fits

What are the best hyperparameters?

hyperparam = pipeline.best_params_
print("The best Hyperparameters for XGBRegressor are: {}".format(hyperparam))

The best Hyperparameters for XGBRegressor are: {'model__n_estimators': 800, 'model__max_depth': 9, 'model__learning_rate': 0.05}

what is the accuracy of the model?

print("Accuracy = {} ".format(pipeline.score(X_test, y_test)))

Accuracy = 0.9586400481884366

Other Regression metrics

ypred = pipeline.predict(X_test)
print("r2_score : ", r2_score(ypred, y_test.values))
print("MSE: ", mean_squared_error(ypred, y_test.values))
print("MAE: ", mean_absolute_error(ypred, y_test.values))

r2_score :  0.9544870993856739
MSE:  1.9455311810660134
MAE:  0.9118466592072646

Bonus

!pip install sweetviz

Collecting sweetviz
  Downloading sweetviz-2.1.3-py3-none-any.whl (15.1 MB)
[K     |████████████████████████████████| 15.1 MB 13.8 MB/s 
Installing collected packages: sweetviz
Successfully installed sweetviz-2.1.3

We can compare distribution of data on train set and test set using sweetviz.

import sweetviz as sv
report = sv.compare([X_train, "train data"], [X_test, "test data"])
report.show_html()
report.show_notebook(w="100%", h="full") # if working in colab

Report SWEETVIZ_REPORT.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.

Click this link to see the output: Link SweetViz Output

References

Galli, S. (2020). Python feature engineering cookbook: over 70 recipes for creating, engineering, and transforming features to build machine learning models. Packt Publishing Ltd.

Zheng, A., & Casari, A. (2018). Feature engineering for machine learning: principles and techniques for data scientists. “ O’Reilly Media, Inc.”.

https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing

https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py

Share on

Twitter Facebook LinkedIn

Kshitiz Regmi

XGBoost Hyperparameter tuning: XGBRegressor (XGBoost Regression)

What you will learn?

How to tune XGBRegressor() using RandomizedSearchCV()

Exploratory data analysis (EDA) on airfoil data.

Why KBinsDiscretizer()?

What about other features?

Feature Engineering and Transformation on airfoil data based on EDA

What are the best hyperparameters?

what is the accuracy of the model?

Other Regression metrics

Bonus

References

Share on

You May Also Enjoy

Mean Can Lie: Discover the Real Insights with Mean and Standard Deviation

Mean

NER powered Semantic Search Engine

A Serverless “Hello, From Lambda!” with AWS Lambda

**A Serverless “Hello, From Lambda!” with AWS Lambda**

Build, train, and deploy, a machine learning model with Amazon SageMaker notebook instance

Why amazon SageMaker?